Memory Access Patterns on Architectures with Local Memory
نویسندگان
چکیده
Nowadays architectures and their programming model implementations are becoming increasingly complex and diverse, making the performance benefits of using local memory unpredictable via only simplistic modeling. In this paper, we present a benchmark-based approach to tackle this issue. We first present a two-part approach to describe memory access patterns for many-thread applications. For each MAP, we design benchmarks of native versions (without local memory) and optimized versions (using local memory). Then we evaluate them on typically used platforms (NVIDIA GTX280, NVIDIA GTX580, AMD HD6970, and Intel E5620), compare the performance of native versions versus optimized versions, and get a performance database. This database can provide essential information for automated usage of local memory.
منابع مشابه
FPGA Implementation of a Hammerstein Based Digital Predistorter for Linearizing RF Power Amplifiers with Memory Effects
Power amplifiers (PAs) are inherently nonlinear elements and digital predistortion is a highly cost-effective approach to linearize them. Although most existing architectures assume that the PA has a memoryless nonlinearity, memory effects of the PAs in many applications ,such as wideband code-division multiple access (WCDMA) or orthogonal frequency-division multiplexing (OFDM), can no longer b...
متن کاملAristotle: A performance impact indicator for the OpenCL kernels using local memory
Due to the increasing complexity of multi/manycore architectures (with their mix of caches and scratch-pad memories) and applications (with different memory access patterns), the performance of many workloads becomes increasingly variable. In this work, we address one of the main causes for this performance variability: the efficiency of the memory system. Specifically, based on an empirical ev...
متن کاملKokkos: Enabling manycore performance portability through polymorphic memory access patterns
The manycore revolution can be characterized by increasing thread counts, decreasing memory per thread, and diversity of continually evolvingmanycore architectures. High performance computing (HPC) applications and librariesmust exploit increasingly finer levels of parallelismwithin their codes to sustain scalability on these devices. A major obstacle to performance portability is the diverse a...
متن کاملDesign and Evaluation of Data Access Prediction Strategies in SDSM Systems
Software Distributed Shared Memory (SDSM) systems provide the shared memory abstraction on top of a message passing hardware, simplifying application programming in these architectures. However, some memory references exhibit long latencies due to remotely cached data. In order to hide this latency, many techniques that propagate data speculatively were developed. This requires that the data ac...
متن کاملMemory Latency in Distributed Shared-Memory Multiprocessors
Analytical models were developed and simulations of memory latency were performed for Uniform Memory Access (UMA), Non-Uniform Memory Access (NUMA), Local-Remote-Global (LRG), and Replicated Concurrent-Read ( R C R ) architectures for hit rates from 0.1 to 0.9 in steps of 0.1, memory access times of 10 nsec to 100 nsec, proportions of read/write access from 0.01 to 0.1, and block sizes of 8 to ...
متن کامل